Introduction to the NBA Player Statistics Dataset

The NBA Player Statistics dataset [1] provides a comprehensive compilation of performance metrics for players across multiple seasons of the National Basketball Association (NBA). This dataset encompasses a wide range of statistics that reflect player performance in various aspects of the game, including offensive and defensive skills.

Dataset Overview:

Scope: The data captures detailed player statistics such as points per game, assists, rebounds, steals, and blocks, among others. This allows for a multifaceted analysis of player contributions and effectiveness on the court.

Utility: Analysts and enthusiasts can use this dataset to evaluate player performance trends, develop predictive models for future performances, and compare players across different seasons and team compositions. Applications: Beyond individual player analysis, the dataset serves as a foundational tool for team strategy development, game analytics, and in-depth research into the dynamics of professional basketball.

Assumptions and Overview

Positions in basketball:

This dataset has been modified to show only the 5 typical basketball positions: Center, Power Forward (PF), Point Guard (PG), Small Forward (SF), and Shooting Guard (SG). In this dataset some players play more than one position so they would be labelled as position1-position2. In this dataset we adjusted to show only the main position1 of a player.

The following is what the columns of data stand for:

  • Rk: Rank of the player (integer)

  • Player: Name of the player (character)

  • Pos: Position of the player (factor with 5 levels: “C”, “PF”, “PG”, “SF”, “SG”)

  • Age: Age of the player (integer)

  • Tm: Team of the player (factor with 38 levels)

  • G: Number of games the player was in (integer)

  • GS: Number of games the player started (integer)

  • MP: Minutes played per game (numeric)

  • FG: Field goals made per game (numeric)

  • FGA: Field goal attempts per game (numeric)

  • FG.: Field goal percentage (numeric)

  • X3P: 3-point field goals made per game (numeric)

  • X3PA: 3-point field goal attempts per game (numeric)

  • X3P.: 3-point field goal percentage (numeric)

  • X2P: 2-point field goals made per game (numeric)

  • X2PA: 2-point field goal attempts per game (numeric)

  • X2P.: 2-point field goal percentage (numeric)

  • eFG.: Effective field goal percentage (numeric)

  • FT: Free throws made per game (numeric)

  • FTA: Free throw attempts per game (numeric)

  • FT.: Free throw percentage (numeric)

  • ORB: Offensive rebounds per game (numeric)

  • DRB: Defensive rebounds per game (numeric)

  • TRB: Total rebounds per game (numeric)

  • AST: Assists per game (numeric)

  • STL: Steals per game (numeric)

  • BLK: Blocks per game (numeric)

  • TOV: Turnovers per game (numeric)

  • PF: Personal fouls per game (numeric)

  • PTS: Points per game (numeric)

  • Season: The season of the record (character)

  • MVP: Whether the player was the Most Valuable Player (factor with 2 levels: “False”, “True”)


LOADING THE DATASET

#data location

setwd("C:/Users/racha/Desktop/STAT 515")
#setwd("/Users/karar/Documents/Mason/DataAnalyticsMasters/STAT515/Final Project/STAT515_Final_Project/")



library(dplyr)
library(caret)
library(ggplot2)
library(GGally)  
library(plotly)
library(tidyverse)
library(randomForest)
library(caret)
library(reshape2)

nba_data = read.csv("nba.csv")  # Ensure the file path is correct
#nba_data = read.csv("NBA_Player_Stats_2.csv") 

print(colnames(nba_data))
##  [1] "Rk"     "Player" "Pos"    "Age"    "Tm"     "G"      "GS"     "MP"    
##  [9] "FG"     "FGA"    "FG."    "X3P"    "X3PA"   "X3P."   "X2P"    "X2PA"  
## [17] "X2P."   "eFG."   "FT"     "FTA"    "FT."    "ORB"    "DRB"    "TRB"   
## [25] "AST"    "STL"    "BLK"    "TOV"    "PF"     "PTS"    "Season" "MVP"
summary(nba_data)
##        Rk           Player              Pos                 Age       
##  Min.   :  1.0   Length:14573       Length:14573       Min.   :18.00  
##  1st Qu.:124.0   Class :character   Class :character   1st Qu.:23.00  
##  Median :243.0   Mode  :character   Mode  :character   Median :26.00  
##  Mean   :244.3                                         Mean   :26.71  
##  3rd Qu.:361.0                                         3rd Qu.:30.00  
##  Max.   :605.0                                         Max.   :44.00  
##                                                                       
##       Tm                  G               GS              MP       
##  Length:14573       Min.   : 1.00   Min.   : 0.00   Min.   : 0.00  
##  Class :character   1st Qu.:22.00   1st Qu.: 0.00   1st Qu.:11.40  
##  Mode  :character   Median :48.00   Median : 7.00   Median :18.90  
##                     Mean   :45.54   Mean   :21.57   Mean   :19.62  
##                     3rd Qu.:70.00   3rd Qu.:39.00   3rd Qu.:27.70  
##                     Max.   :85.00   Max.   :83.00   Max.   :43.70  
##                                                                    
##        FG              FGA              FG.              X3P        
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 1.300   1st Qu.: 3.100   1st Qu.:0.3930   1st Qu.:0.0000  
##  Median : 2.400   Median : 5.500   Median :0.4350   Median :0.3000  
##  Mean   : 2.932   Mean   : 6.599   Mean   :0.4324   Mean   :0.5909  
##  3rd Qu.: 4.100   3rd Qu.: 9.200   3rd Qu.:0.4790   3rd Qu.:1.0000  
##  Max.   :12.200   Max.   :27.800   Max.   :1.0000   Max.   :5.3000  
##                                    NA's   :88                       
##       X3PA             X3P.             X2P              X2PA       
##  Min.   : 0.000   Min.   :0.0000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.100   1st Qu.:0.2220   1st Qu.: 1.000   1st Qu.: 2.100  
##  Median : 1.100   Median :0.3260   Median : 1.800   Median : 3.900  
##  Mean   : 1.704   Mean   :0.2843   Mean   : 2.342   Mean   : 4.895  
##  3rd Qu.: 2.800   3rd Qu.:0.3750   3rd Qu.: 3.300   3rd Qu.: 6.800  
##  Max.   :13.200   Max.   :1.0000   Max.   :12.100   Max.   :23.400  
##                   NA's   :2198                                      
##       X2P.             eFG.              FT              FTA        
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.4230   1st Qu.:0.4380   1st Qu.: 0.500   1st Qu.: 0.700  
##  Median :0.4700   Median :0.4830   Median : 1.000   Median : 1.400  
##  Mean   :0.4648   Mean   :0.4735   Mean   : 1.401   Mean   : 1.872  
##  3rd Qu.:0.5140   3rd Qu.:0.5240   3rd Qu.: 1.900   3rd Qu.: 2.500  
##  Max.   :1.0000   Max.   :1.5000   Max.   :10.300   Max.   :13.100  
##  NA's   :154      NA's   :88                                        
##       FT.              ORB            DRB              TRB       
##  Min.   :0.0000   Min.   :0.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:0.6600   1st Qu.:0.30   1st Qu.: 1.300   1st Qu.: 1.70  
##  Median :0.7500   Median :0.70   Median : 2.200   Median : 2.90  
##  Mean   :0.7262   Mean   :0.91   Mean   : 2.522   Mean   : 3.43  
##  3rd Qu.:0.8220   3rd Qu.:1.30   3rd Qu.: 3.300   3rd Qu.: 4.60  
##  Max.   :1.0000   Max.   :6.80   Max.   :12.000   Max.   :18.00  
##  NA's   :749                                                     
##       AST              STL              BLK              TOV       
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.: 0.500   1st Qu.:0.3000   1st Qu.:0.1000   1st Qu.:0.600  
##  Median : 1.200   Median :0.5000   Median :0.2000   Median :1.000  
##  Mean   : 1.758   Mean   :0.6215   Mean   :0.3902   Mean   :1.132  
##  3rd Qu.: 2.300   3rd Qu.:0.9000   3rd Qu.:0.5000   3rd Qu.:1.500  
##  Max.   :12.800   Max.   :3.0000   Max.   :6.0000   Max.   :5.700  
##                                                                    
##        PF             PTS            Season              MVP           
##  Min.   :0.000   Min.   : 0.000   Length:14573       Length:14573      
##  1st Qu.:1.200   1st Qu.: 3.400   Class :character   Class :character  
##  Median :1.800   Median : 6.400   Mode  :character   Mode  :character  
##  Mean   :1.782   Mean   : 7.853                                        
##  3rd Qu.:2.400   3rd Qu.:11.100                                        
##  Max.   :6.000   Max.   :36.100                                        
## 
str(nba_data)
## 'data.frame':    14573 obs. of  32 variables:
##  $ Rk    : int  1 2 3 4 4 4 5 6 7 8 ...
##  $ Player: chr  "Mahmoud Abdul-Rauf" "Tariq Abdul-Wahad" "Shareef Abdur-Rahim" "Cory Alexander" ...
##  $ Pos   : chr  "PG" "SG" "SF" "PG" ...
##  $ Age   : int  28 23 21 24 24 24 22 23 33 27 ...
##  $ Tm    : chr  "SAC" "SAC" "VAN" "TOT" ...
##  $ G     : int  31 59 82 60 37 23 82 66 50 61 ...
##  $ GS    : int  0 16 82 22 3 19 82 13 0 56 ...
##  $ MP    : num  17.1 16.3 36 21.6 13.5 34.7 40.1 27.9 8 30.5 ...
##  $ FG    : num  3.3 2.4 8 2.9 1.6 4.8 6.9 3.6 0.7 4.4 ...
##  $ FGA   : num  8.8 6.1 16.4 6.7 3.9 11.1 16 8.9 1.6 11 ...
##  $ FG.   : num  0.377 0.403 0.485 0.428 0.414 0.435 0.428 0.408 0.444 0.398 ...
##  $ X3P   : num  0.2 0.1 0.3 1.1 0.5 2 1.6 0.3 0 0.9 ...
##  $ X3PA  : num  1 0.3 0.6 2.9 1.7 4.9 4.5 1.3 0.1 2.6 ...
##  $ X3P.  : num  0.161 0.211 0.412 0.375 0.313 0.411 0.364 0.202 0 0.356 ...
##  $ X2P   : num  3.2 2.4 7.7 1.8 1.1 2.8 5.2 3.4 0.7 3.5 ...
##  $ X2PA  : num  7.8 5.7 15.8 3.7 2.2 6.2 11.5 7.6 1.5 8.4 ...
##  $ X2P.  : num  0.405 0.414 0.488 0.469 0.494 0.455 0.453 0.442 0.474 0.411 ...
##  $ eFG.  : num  0.386 0.409 0.493 0.51 0.483 0.525 0.479 0.422 0.444 0.44 ...
##  $ FT    : num  0.5 1.4 6.1 1.3 0.7 2.4 4.2 4.2 0.3 2.5 ...
##  $ FTA   : num  0.5 2.1 7.8 1.7 1 2.8 4.8 4.8 0.8 3.2 ...
##  $ FT.   : num  1 0.672 0.784 0.784 0.676 0.846 0.875 0.873 0.39 0.789 ...
##  $ ORB   : num  0.2 0.7 2.8 0.3 0.2 0.4 1.5 0.8 0.8 0.6 ...
##  $ DRB   : num  1 1.2 4.3 2.2 1.1 3.9 3.4 2 1.6 2.2 ...
##  $ TRB   : num  1.2 2 7.1 2.4 1.3 4.3 4.9 2.8 2.4 2.8 ...
##  $ AST   : num  1.9 0.9 2.6 3.5 1.9 6 4.3 3.4 0.3 5.7 ...
##  $ STL   : num  0.5 0.6 1.1 1.2 0.7 2 1.4 1.3 0.4 1.4 ...
##  $ BLK   : num  0 0.2 0.9 0.2 0.1 0.3 0.1 0.2 0.2 0 ...
##  $ TOV   : num  0.6 1.1 3.1 1.9 1.3 2.8 3.2 1.9 0.3 2.3 ...
##  $ PF    : num  1 1.4 2.5 1.6 1.4 2 3 2.1 1.7 2.2 ...
##  $ PTS   : num  7.3 6.4 22.3 8.1 4.5 14 19.5 11.7 1.8 12.2 ...
##  $ Season: chr  "1997-98" "1997-98" "1997-98" "1997-98" ...
##  $ MVP   : chr  "False" "False" "False" "False" ...

Data Cleaning

# Handling missing values
nba_data = na.omit(nba_data) 

# Pre-process the data: Convert factors
nba_data$Tm = as.factor(nba_data$Tm)
nba_data$MVP = as.factor(nba_data$MVP)

nba_data = nba_data %>% separate(Pos, into = c("Pos", "Pos2"), sep = "-")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 11764 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
nba_data = nba_data %>% subset(select = -Pos2)

nba_data$Pos = as.factor(nba_data$Pos)

Question 1: How do key performance metrics such as field goal percentage, assists, and total rebounds influence a player’s scoring outcomes in NBA games, and how do these relationships vary by player position?

Data Filtering

nba_data_filtered = nba_data %>%
  filter(G > 20)  

Once this operation is executed, the resulting dataset (nba_data_filtered) will contain only those players who have played more than 20 games. This filtered dataset is likely to have fewer outliers in performance metrics caused by small sample sizes.

Correcting Percentage Fields

nba_data_filtered = nba_data_filtered %>%
  mutate(`FG%` = `FG.` / 100,
         `3P%` = `X3P.` / 100,
         `2P%` = `X2P.` / 100,
         `eFG%` = `eFG.` / 100,
         `FT%` = `FT.` / 100) %>%
  select(-c(`FG.`, `X3P.`, `X2P.`, `eFG.`, `FT.`))  

Interpretation:

Once this snippet is executed, the resulting dataset will have correctly formatted percentage columns which are essential for any statistical analysis involving ratios or comparisons, such as calculating efficiency or shooting accuracy. This step also cleans up the dataset by removing the original columns that are no longer necessary after the correction.

Creating Visualizations

library(ggplot2)

ggplot(nba_data, aes(x = PTS)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  labs(title = "Distribution of Points Per Game", x = "Points Per Game", y = "Frequency")

ggplot(nba_data, aes(x = Pos, y = PTS, fill = Pos)) +
  geom_boxplot() +
  labs(title = "Points Per Game by Player Position", x = "Position", y = "Points Per Game")

Histogram: Distribution of Points Per Game

The histogram shows that the distribution of points per game is right-skewed, meaning most players score on the lower end of the scale, with fewer players averaging high points per game. This is typical in sports data where only a few top performers reach the higher end of the scoring spectrum. The peak of the distribution is around 2 to 6 points per game, indicating that this range is the most common scoring output among players.

Boxplot: Points Per Game by Player Position

The boxplot reveals several interesting points about scoring across different positions:

  • Variability: There’s a notable variation in median points per game among positions. For instance, positions like Shooting Guard (SG) and Point Guard (PG) typically have higher medians and wider interquartile ranges, suggesting these positions are likely to score more.

  • Outliers: Several positions show outliers, especially in scoring roles like SG and PG, indicating some players in these positions significantly outscore their peers.

  • Positional Roles: The plot shows the differences in scoring roles within teams, where guards generally score more than forwards and centers. This could be indicative of the offensive responsibilities typically assigned to these positions in basketball.

These insights can be particularly useful for team strategy, indicating which positions might require more focus in training for scoring or recruitment to balance team capabilities. They also provide a foundational understanding for further statistical testing, such as comparing means across groups or correlating scoring with other factors like age or experience.

Multivariate Regression Analysis

reg_model = lm(PTS ~ `FG%` + `3P%` + `2P%` + `eFG%` + `FT%` + AST + TRB + Pos, data = nba_data_filtered)
summary(reg_model)
## 
## Call:
## lm(formula = PTS ~ `FG%` + `3P%` + `2P%` + `eFG%` + `FT%` + AST + 
##     TRB + Pos, data = nba_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6594  -1.8248  -0.1881   1.6152  17.4742 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -12.90504    0.37652 -34.274  < 2e-16 ***
## `FG%`        883.47516  125.94807   7.015 2.45e-12 ***
## `3P%`        276.85679   28.17634   9.826  < 2e-16 ***
## `2P%`       -387.96409   94.98483  -4.084 4.45e-05 ***
## `eFG%`       489.11420  117.26342   4.171 3.06e-05 ***
## `FT%`        964.34134   31.09663  31.011  < 2e-16 ***
## AST            1.51059    0.02458  61.447  < 2e-16 ***
## TRB            1.31314    0.01980  66.336  < 2e-16 ***
## PosPF          1.19888    0.11316  10.594  < 2e-16 ***
## PosPG          0.34426    0.16581   2.076   0.0379 *  
## PosSF          2.60046    0.12777  20.353  < 2e-16 ***
## PosSG          3.40874    0.14009  24.333  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.099 on 10032 degrees of freedom
## Multiple R-squared:  0.714,  Adjusted R-squared:  0.7137 
## F-statistic:  2277 on 11 and 10032 DF,  p-value: < 2.2e-16

Key Model Outputs:

  1. Coefficients and Significance:

    • FG%, 3P%, eFG%, FT%: Significant positive coefficients for these variables indicate that higher shooting efficiencies (field goal, three-point, effective field goal, and free throw percentages) are associated with higher points per game. Notably, FT% and FG% have exceptionally high coefficients, suggesting a strong impact on scoring.

    • 2P%: Interestingly, this has a significant negative coefficient, which might suggest multicollinearity issues given its likely correlation with FG% and eFG%.

    • Assists (AST) and Total Rebounds (TRB): Both have significant positive impacts on scoring, underscoring the value of players who contribute beyond just shooting.

  2. Player Position (Pos):

    • Positions like SG (Shooting Guard), SF (Small Forward), and their combinations with other positions generally show significant positive coefficients, indicating these positions typically score more points compared to the baseline position (likely PG - Point Guard).

    • The coefficients for different positions highlight the scoring dynamics associated with each role on the court, with guards and forwards often contributing more to scoring.

  3. Model Fit:

    • R-squared (0.7147): About 71.47% of the variability in points scored is explained by the model, which is quite high, suggesting a good fit.

    • Adjusted R-squared (0.7141): This is very close to the R-squared value, indicating that the number of predictors in the model is justified given the amount of data.

      Grouping and Summarizing Data by Position

position_summary = nba_data_filtered %>%
  group_by(Pos) %>%
  summarise(
    Avg_Points = mean(PTS),
    Avg_AST = mean(AST),
    Avg_TRB = mean(TRB),
    Avg_FG_Percentage = mean(`FG%`),  # Changed variable name
    Avg_3P_Percentage = mean(`3P%`),  # Changed variable name
    Avg_2P_Percentage = mean(`2P%`),  # Include if needed
    Avg_eFG_Percentage = mean(`eFG%`),  # Include if needed
    Avg_FT_Percentage = mean(`FT%`)  # Include if needed
  )

print(position_summary)
## # A tibble: 5 × 9
##   Pos   Avg_Points Avg_AST Avg_TRB Avg_FG_Percentage Avg_3P_Percentage
##   <fct>      <dbl>   <dbl>   <dbl>             <dbl>             <dbl>
## 1 C           8.96    1.26    6.05           0.00502           0.00171
## 2 PF          9.43    1.42    5.16           0.00457           0.00252
## 3 PG          9.50    4.02    2.50           0.00417           0.00324
## 4 SF          9.59    1.68    3.69           0.00433           0.00322
## 5 SG         10.1     2.13    2.76           0.00421           0.00339
## # ℹ 3 more variables: Avg_2P_Percentage <dbl>, Avg_eFG_Percentage <dbl>,
## #   Avg_FT_Percentage <dbl>

Interpretation:

The resulting table from this snippet will show the average points, assists, rebounds, and shooting percentages for each position. This can provide insights into:

  • Offensive and Defensive Roles: Which positions contribute more to scoring or playmaking? Are certain positions more rebound-intensive?

  • Shooting Efficiency: Which positions have higher shooting percentages? This can indicate specialized training or positional roles in shooting.

    Extracting and Visualizing Regression Coefficients

coef_data = as.data.frame(summary(reg_model)$coefficients)


ggplot(coef_data, aes(x = rownames(coef_data), y = Estimate, fill = Estimate)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Regression Coefficients: Predicting Points Per Game", x = "Predictors", y = "Coefficient Estimate")

Visualization Interpretation:

The chart clearly illustrates the impact of various predictors on points per game according to the regression model. Here are some key observations:

  1. High Positive Impact:

    • FT% (Free Throw Percentage): Shows the largest positive coefficient, indicating that improvements in free throw shooting percentage have a significant positive effect on points per game.

    • FG% (Field Goal Percentage): Also demonstrates a substantial positive impact, which is expected as making more field goals directly contributes to higher scoring.

  2. Negative Impact:

    • 2P% (Two-Point Percentage): This predictor shows a negative coefficient, which could suggest a substitution effect with other types of shots (like three-pointers) or may indicate multicollinearity with other shooting percentage variables.

    • eFG% (Effective Field Goal Percentage): Interestingly, this coefficient is also positive but less impactful compared to FG% and FT%, which might be due to the way it is calculated (considering three-point field goals).

  3. Positions:

    • Different player positions (e.g., PosSF, PosSG) show varying levels of impact, with some positions like Shooting Guard (PosSG) and Small Forward (PosSF) showing positive coefficients, indicating that these positions, typically scoring roles, are likely to score more points.
  4. Other Stats:

    • Assists (AST) and Total Rebounds (TRB): Both have positive impacts but are smaller compared to shooting percentages, suggesting while they contribute to scoring, the direct impact of shooting efficiency is more pronounced.

      Calculating and Visualizing Correlation Matrix

cor_matrix = cor(nba_data_filtered[, c("PTS", "AST", "TRB", "FG%", "3P%", "2P%", "eFG%", "FT%")])


library(reshape2)
melted_cor_matrix = melt(cor_matrix)
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title = element_blank())

Interpretation of the Correlation Matrix Heatmap:

  1. Points (PTS) Correlations:

    • High positive correlations with field goal percentage (FG%), effective field goal percentage (eFG%), and free throw percentage (FT%). This indicates that players who have higher shooting efficiencies tend to score more points.

    • Moderate positive correlation with assists (AST) and total rebounds (TRB), suggesting that players who are more involved in the game (either through passing or rebounding) also tend to score more.

  2. Assists (AST):

    • Shows positive correlations with FG%, 3P%, and eFG%. This could imply that players who assist more are involved in plays that lead to effective shooting, possibly indicating good playmaking leads to more efficient scoring opportunities.
  3. Total Rebounds (TRB):

    • Positively correlated with FG% and 2P% but less so with 3P%. This might reflect that players who are good at rebounding are often in positions to make two-point shots (perhaps due to being closer to the basket).
  4. Shooting Percentages (FG%, 3P%, 2P%, eFG%, FT%):

    • The correlations among different types of shooting percentages are generally high, which is expected as they are not independent of each other. For instance, eFG%, which accounts for the fact that three-point field goals count more than two-point field goals, is highly correlated with both FG% and 3P%.

    • Free throw percentage (FT%) shows strong correlations with FG% and eFG%, suggesting that players who are good shooters generally perform well across different types of shooting.

  5. Negative Correlations:

    • There are few if any strong negative correlations visible in the heatmap, indicating that these performance metrics generally do not inhibit each other. Any light blue cells represent only mild inverse relationships.

    Building a Predictive Model

predictive_model = lm(PTS ~ `FG%` + AST + TRB, data = nba_data_filtered)
summary(predictive_model)
## 
## Call:
## lm(formula = PTS ~ `FG%` + AST + TRB, data = nba_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4517  -2.1162  -0.5069   1.8058  20.3415 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.006375   0.294076  -0.022    0.983    
## `FG%`       348.506763  71.345737   4.885 1.05e-06 ***
## AST           1.635263   0.020020  81.683  < 2e-16 ***
## TRB           1.158106   0.018098  63.992  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.656 on 10040 degrees of freedom
## Multiple R-squared:  0.6015, Adjusted R-squared:  0.6014 
## F-statistic:  5052 on 3 and 10040 DF,  p-value: < 2.2e-16
  1. Model Fit:

    • Residual Standard Error: The residual standard error is 3.602, which indicates the typical deviation of the observed points scored from the predicted points by the model. A lower value would suggest a tighter fit.

    • R-squared: 0.6094 suggests that approximately 60.94% of the variability in points scored is explained by the model. This is a reasonably good fit for a model in a complex, dynamic setting like sports.

    • Adjusted R-squared: 0.6093 is very close to the R-squared value, indicating that the predictors are relevant and the model is not overly complex for the amount of data.

    • F-statistic: The very low p-value (< 2.2e-16) associated with the F-statistic confirms that the model is statistically significant and that at least one of the predictors has a significant relationship with the points scored.

      Integrating Shiny for Interactive Visualization

library(shiny)
## Warning: package 'shiny' was built under R version 4.3.3
library(ggplot2)
library(dplyr)


ui = fluidPage(
  titlePanel("NBA Player Statistics Explorer"),
  sidebarLayout(
    sidebarPanel(
      selectInput("stat", "Choose a statistic:",
                  choices = c("Points" = "PTS", "Assists" = "AST", "Rebounds" = "TRB")),
      sliderInput("number_of_games", "Minimum Number of Games:",
                  min = 0, max = 82, value = 20),
      selectInput("modelType", "Choose Model Type:",
                  choices = c("Basic Model", "Interaction Model"))
    ),
    mainPanel(
      tabsetPanel(type = "tabs",
        tabPanel("Plot", plotOutput("statPlot")),
        tabPanel("Regression Output", verbatimTextOutput("modelOutput")),
        tabPanel("Summary Statistics", tableOutput("summaryStats"))
         
      )
    )
  )
)


server = function(input, output) {
  # Dynamic plot based on user input
  output$statPlot = renderPlot({
    filtered_data = nba_data %>%
      filter(G >= input$number_of_games)
    ggplot(filtered_data, aes_string(x = "Pos", y = input$stat, fill = "Pos")) +
      geom_boxplot() +
      labs(title = paste(input$stat, "Per Game by Player Position"), x = "Position", y = input$stat) +
      theme_minimal()
  })
  
  
  output$modelOutput = renderPrint({
    if (input$modelType == "Basic Model") {
      summary(lm(PTS ~ `FG%` + AST + TRB, data = nba_data_filtered))
    } else {
      summary(lm(PTS ~ `FG%` + `3P%` + `2P%` + `eFG%` + `FT%` + AST + TRB + Pos, data = nba_data_filtered))
    }
  })
  
  
  output$summaryStats = renderTable({
    nba_data_filtered %>%
      group_by(Pos) %>%
      summarise(
        Avg_Points = mean(PTS),
        Avg_AST = mean(AST),
        Avg_TRB = mean(TRB),
        Avg_FG_Percentage = mean(`FG%`),
        Avg_3P_Percentage = mean(`3P%`)
      )
  })
  
}

# Run the application 
shinyApp(ui = ui, server = server)
Shiny applications not supported in static R Markdown documents

Based on the regression model summary you provided earlier, here are the results and accuracy metrics for the predictive model that assessed the impact of field goal percentage (FG%), assists (AST), and total rebounds (TRB) on points scored (PTS):

  1. Model Effectiveness:

    • The model explains approximately 60.94% of the variance in points scored (R-squared = 0.6094). This is a substantial proportion, indicating that the model has a good level of predictive power considering the complexity and variability inherent in sports performance data.
  2. Predictor Significance:

    • Assists (AST): Highly significant (p < 2e-16) with a coefficient of 1.74311, suggesting a strong positive impact on scoring. Each additional assist is associated with an increase of approximately 1.74 points per game.

    • Total Rebounds (TRB): Also highly significant (p < 2e-16) with a coefficient of 1.12407, indicating that rebounds positively influence scoring, with each additional rebound increasing points by about 1.12.

  3. Field Goal Percentage (FG%):

    • Not statistically significant in this model (p = 0.13914), with a coefficient of 91.05265. This suggests that within the context of this model, the contribution of field goal percentage to scoring is not as clear-cut, possibly due to multicollinearity with other variables or different factors affecting scoring that are not captured solely by FG%.
  4. Model Fit and Accuracy:

    • Residual Standard Error: 3.602 on 11173 degrees of freedom, indicating the average difference between the observed scores and the scores predicted by the model is about 3.6 points. This level of error is reasonable given the variability in individual player performances in basketball.
  5. Statistical Power and Reliability:

    • The F-statistic is 5810 on 3 and 11173 degrees of freedom with a p-value of less than 2.2e-16, demonstrating that the model is statistically robust and the relationships it describes are highly unlikely to be due to random variation.

These results show that while assists and rebounds are good predictors of points scored, the role of field goal percentage might need further investigation, possibly including more data or examining other factors that might interact with or confound the relationship between shooting efficiency and scoring.


Question 2: Does the distribution of steals per game (STL) vary significantly across different player positions during the 2019-2020 NBA season?

Initial Data Load and Examination of Positions

setwd("C:/Users/racha/Desktop/STAT 515")
nba_data = read.csv("nba.csv")

unique_positions = unique(nba_data$Pos)

print(unique_positions)
##  [1] "PG"    "SG"    "SF"    "C"     "PF"    "SG-SF" "SG-PG" "PF-C"  "SF-SG"
## [10] "SF-PF" "PF-SF" "C-PF"  "PG-SG" "PG-SF" "SG-PF" "SF-C"

NA Check and Data Distribution for ‘STL’

library(dplyr)
library(tidyr)

na_count = nba_data %>%
  filter(Season == "2019-20") %>%
  summarise(NA_in_STL = sum(is.na(STL)))

print(na_count)
##   NA_in_STL
## 1         0
stl_distribution = nba_data %>%
  filter(Season == "2019-20", !is.na(STL)) %>%
  summarise(
    Min_STL = min(STL),
    Max_STL = max(STL),
    Mean_STL = mean(STL)
  )

print(stl_distribution)
##   Min_STL Max_STL  Mean_STL
## 1       0     2.1 0.6160436
nba_filtered = nba_data %>%
  filter(Season == "2019-20", !is.na(STL)) %>%
  select(Pos, STL,Age)


nba_filtered$Pos = factor(nba_filtered$Pos)

str(nba_filtered)
## 'data.frame':    642 obs. of  3 variables:
##  $ Pos: Factor w/ 14 levels "C","C-PF","PF",..: 1 3 1 1 12 12 1 6 3 12 ...
##  $ STL: num  0.8 1.1 0.7 0 0.4 0.3 0.6 0.5 1 0 ...
##  $ Age: int  26 22 34 23 21 24 21 27 29 26 ...
head(nba_filtered)
##   Pos STL Age
## 1   C 0.8  26
## 2  PF 1.1  22
## 3   C 0.7  34
## 4   C 0.0  23
## 5  SG 0.4  21
## 6  SG 0.3  24

This code snippet is crucial for ensuring data quality by identifying and removing any missing values in the ‘STL’ column for the 2019-2020 NBA season. It also provides a statistical summary of the steals per game, calculating the minimum, maximum, and average steals, which offers a preliminary understanding of the data’s distribution. The script further refines the dataset by filtering relevant columns and ensuring the ‘Pos’ column is treated as a categorical factor, setting the stage for accurate and meaningful analysis.

library(ggplot2)


ggplot(nba_filtered, aes(x = Pos, y = STL, fill = Pos)) +
  geom_boxplot() +
  labs(title = "Distribution of Steals Per Game by Position",
       x = "Position",
       y = "Steals Per Game") +
  theme_minimal()

anova_results = aov(STL ~ Pos, data = nba_filtered)


anova_summary = summary(anova_results)
print(anova_summary)
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## Pos          13   7.53  0.5794   3.729 9.3e-06 ***
## Residuals   628  97.59  0.1554                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA test [2] statistically confirms that not all player positions contribute equally to steals, with some positions likely showing higher or lower average steals than others. This finding is crucial for understanding defensive roles and can inform coaching strategies, player development, and game tactics based on positional roles.

tukey_results = TukeyHSD(anova_results)
print(tukey_results)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = STL ~ Pos, data = nba_filtered)
## 
## $Pos
##                      diff          lwr         upr     p adj
## C-PF-C      -2.159091e-01 -1.548206908  1.11638873 0.9999994
## PF-C         1.693762e-02 -0.144941689  0.17881694 1.0000000
## PF-C-C       1.440909e-01 -0.460624141  0.74880596 0.9999402
## PF-SF-C     -1.590909e-02 -0.790873492  0.75905531 1.0000000
## PG-C         2.927448e-01  0.118718479  0.46677103 0.0000020
## PG-SG-C     -1.159091e-01 -1.448206908  1.21638873 1.0000000
## SF-C         1.877273e-01  0.016376197  0.35907835 0.0172595
## SF-C-C      -2.159091e-01 -1.548206908  1.11638873 0.9999994
## SF-PF-C      1.507576e-01 -0.624206825  0.92572198 0.9999944
## SF-SG-C      1.840909e-01 -0.761520922  1.12970274 0.9999944
## SG-C         7.337662e-02 -0.087649349  0.23440260 0.9609481
## SG-PG-C     -1.159091e-01 -1.061520922  0.82970274 1.0000000
## SG-SF-C     -1.159091e-01 -1.448206908  1.21638873 1.0000000
## PF-C-PF      2.328467e-01 -1.099268293  1.56496172 0.9999985
## PF-C-C-PF    3.600000e-01 -1.093962095  1.81396210 0.9999074
## PF-SF-C-PF   2.000000e-01 -1.332610617  1.73261062 1.0000000
## PG-C-PF      5.086538e-01 -0.824991769  1.84229946 0.9916208
## PG-SG-C-PF   1.000000e-01 -1.777056994  1.97705699 1.0000000
## SF-C-PF      4.036364e-01 -0.929662805  1.73693553 0.9991607
## SF-C-C-PF   -4.718448e-15 -1.877056994  1.87705699 1.0000000
## SF-PF-C-PF   3.666667e-01 -1.165943951  1.89927728 0.9999374
## SF-SG-C-PF   4.000000e-01 -1.225579041  2.02557904 0.9999137
## SG-C-PF      2.892857e-01 -1.042725865  1.62129729 0.9999795
## SG-PG-C-PF   1.000000e-01 -1.525579041  1.72557904 1.0000000
## SG-SF-C-PF   1.000000e-01 -1.777056994  1.97705699 1.0000000
## PF-C-PF      1.271533e-01 -0.477158896  0.73146547 0.9999859
## PF-SF-PF    -3.284672e-02 -0.807496793  0.74180336 1.0000000
## PG-PF        2.758071e-01  0.103185971  0.44842829 0.0000094
## PG-SG-PF    -1.328467e-01 -1.464961723  1.19926829 1.0000000
## SF-PF        1.707896e-01  0.000865809  0.34071349 0.0474094
## SF-C-PF     -2.328467e-01 -1.564961723  1.09926829 0.9999985
## SF-PF-PF     1.338200e-01 -0.640830126  0.90847003 0.9999987
## SF-SG-PF     1.671533e-01 -0.778200964  1.11250753 0.9999982
## SG-PF        5.643900e-02 -0.103067376  0.21594537 0.9958987
## SG-PG-PF    -1.328467e-01 -1.078200964  0.81250753 0.9999999
## SG-SF-PF    -1.328467e-01 -1.464961723  1.19926829 1.0000000
## PF-SF-PF-C  -1.600000e-01 -1.129308063  0.80930806 0.9999992
## PG-PF-C      1.486538e-01 -0.459024888  0.75633258 0.9999193
## PG-SG-PF-C  -2.600000e-01 -1.713962095  1.19396210 0.9999980
## SF-PF-C      4.363636e-02 -0.563281663  0.65055439 1.0000000
## SF-C-PF-C   -3.600000e-01 -1.813962095  1.09396210 0.9999074
## SF-PF-PF-C   6.666667e-03 -0.962641397  0.97597473 1.0000000
## SF-SG-PF-C   4.000000e-02 -1.070481893  1.15048189 1.0000000
## SG-PF-C     -7.071429e-02 -0.674798438  0.53336987 1.0000000
## SG-PG-PF-C  -2.600000e-01 -1.370481893  0.85048189 0.9999511
## SG-SF-PF-C  -2.600000e-01 -1.713962095  1.19396210 0.9999980
## PG-PF-SF     3.086538e-01 -0.468625367  1.08593306 0.9878809
## PG-SG-PF-SF -1.000000e-01 -1.632610617  1.43261062 1.0000000
## SF-PF-SF     2.036364e-01 -0.573048271  0.98032100 0.9998239
## SF-C-PF-SF  -2.000000e-01 -1.732610617  1.33261062 1.0000000
## SF-PF-PF-SF  1.666667e-01 -0.917052694  1.25038603 0.9999997
## SF-SG-PF-SF  2.000000e-01 -1.011635079  1.41163508 0.9999992
## SG-PF-SF     8.928571e-02 -0.685186489  0.86375792 1.0000000
## SG-PG-PF-SF -1.000000e-01 -1.311635079  1.11163508 1.0000000
## SG-SF-PF-SF -1.000000e-01 -1.632610617  1.43261062 1.0000000
## PG-SG-PG    -4.086538e-01 -1.742299462  0.92499177 0.9990467
## SF-PG       -1.050175e-01 -0.286550797  0.07651583 0.7991152
## SF-C-PG     -5.086538e-01 -1.842299462  0.82499177 0.9916208
## SF-PF-PG    -1.419872e-01 -0.919266393  0.63529203 0.9999974
## SF-SG-PG    -1.086538e-01 -1.056163682  0.83885599 1.0000000
## SG-PG       -2.193681e-01 -0.391189308 -0.04754696 0.0016213
## SG-PG-PG    -4.086538e-01 -1.356163682  0.53885599 0.9751039
## SG-SF-PG    -4.086538e-01 -1.742299462  0.92499177 0.9990467
## SF-PG-SG     3.036364e-01 -1.029662805  1.63693553 0.9999645
## SF-C-PG-SG  -1.000000e-01 -1.977056994  1.77705699 1.0000000
## SF-PF-PG-SG  2.666667e-01 -1.265943951  1.79927728 0.9999986
## SF-SG-PG-SG  3.000000e-01 -1.325579041  1.92557904 0.9999970
## SG-PG-SG     1.892857e-01 -1.142725865  1.52129729 0.9999999
## SG-PG-PG-SG -2.775558e-15 -1.625579041  1.62557904 1.0000000
## SG-SF-PG-SG -6.661338e-16 -1.877056994  1.87705699 1.0000000
## SF-C-SF     -4.036364e-01 -1.736935533  0.92966281 0.9991607
## SF-PF-SF    -3.696970e-02 -0.813654331  0.73971494 1.0000000
## SF-SG-SF    -3.636364e-03 -0.950658504  0.94338578 1.0000000
## SG-SF       -1.143506e-01 -0.283461746  0.05476045 0.5729646
## SG-PG-SF    -3.036364e-01 -1.250658504  0.64338578 0.9984732
## SG-SF-SF    -3.036364e-01 -1.636935533  1.02966281 0.9999645
## SF-PF-SF-C   3.666667e-01 -1.165943951  1.89927728 0.9999374
## SF-SG-SF-C   4.000000e-01 -1.225579041  2.02557904 0.9999137
## SG-SF-C      2.892857e-01 -1.042725865  1.62129729 0.9999795
## SG-PG-SF-C   1.000000e-01 -1.525579041  1.72557904 1.0000000
## SG-SF-SF-C   1.000000e-01 -1.777056994  1.97705699 1.0000000
## SF-SG-SF-PF  3.333333e-02 -1.178301746  1.24496841 1.0000000
## SG-SF-PF    -7.738095e-02 -0.851853156  0.69709125 1.0000000
## SG-PG-SF-PF -2.666667e-01 -1.478301746  0.94496841 0.9999761
## SG-SF-SF-PF -2.666667e-01 -1.799277284  1.26594395 0.9999986
## SG-SF-SG    -1.107143e-01 -1.055922785  0.83449421 1.0000000
## SG-PG-SF-SG -3.000000e-01 -1.627279729  1.02727973 0.9999674
## SG-SF-SF-SG -3.000000e-01 -1.925579041  1.32557904 0.9999970
## SG-PG-SG    -1.892857e-01 -1.134494213  0.75592278 0.9999921
## SG-SF-SG    -1.892857e-01 -1.521297293  1.14272586 0.9999999
## SG-SF-SG-PG  2.109424e-15 -1.625579041  1.62557904 1.0000000

Tukey HSD Test

  • Purpose: The Tukey HSD [3] test is used following an ANOVA when the overall test indicates significant differences. It helps in conducting pairwise comparisons between all possible pairs of group means to specifically identify which positions differ from each other in terms of their average steals per game.

  • Utility: This test provides detailed insights by comparing every position against every other position, adjusting for multiple comparisons to maintain the overall type I error rate. It’s essential for understanding not just if differences exist, but where they exist.

    • Significant Comparisons:

      • PG and C: The point guard (PG) position shows a significantly higher number of steals compared to the center (C) position, with a positive difference and a very low p-value (0.0000020), indicating strong statistical significance.

      • SF and C: Small forwards (SF) also show a significantly higher number of steals compared to centers, with a moderate p-value (0.0172595).

      • PG and PF: Another significant finding is between point guards and power forwards (PF), where PGs have more steals, indicated by another low p-value (0.0000094).

library(ggplot2)
library(plotly)


steals_summary = nba_filtered %>%
  dplyr::group_by(Pos) %>%
  dplyr::summarise(
    Mean_STL = mean(STL),
    SE_STL = sd(STL) / sqrt(n())  # Standard Error of the mean
  )


p = ggplot(steals_summary, aes(x = Pos, y = Mean_STL, fill = Pos)) +
  geom_bar(stat = "identity", position = position_dodge(), width = 0.7) +
  geom_errorbar(aes(ymin = Mean_STL - SE_STL, ymax = Mean_STL + SE_STL),
                width = 0.2, position = position_dodge(0.7)) +
  labs(title = "Average Steals Per Game by Position",
       x = "Position",
       y = "Average Steals Per Game") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for clarity


plotly_obj = ggplotly(p)


plotly_obj

Interpretation of the Bar Plot:

  • Variability and Trends: The bar heights indicate the mean steals per game for each position, showing clear variability across different positions. Positions such as PG (Point Guard) and SF (Small Forward) typically have higher averages, which might reflect their roles involving more perimeter defense and opportunities for steals.

  • Error Bars: The error bars, representing the standard error of the mean, provide insights into the spread of data around the mean for each position. Positions with longer error bars have more variability in player performance regarding steals, while shorter bars suggest more consistency among players in that position.

  • Strategic Insights: Coaches and team analysts can use this data to focus on training and strategic planning. For example, strengthening the defensive skills of players in positions with lower average steals might improve overall team performance.


Question 3: Predicting player position based on their stats

setwd("C:/Users/racha/Desktop/STAT 515")
nba_data = read.csv("nba.csv")
library(dplyr)
library(ggplot2)
library(randomForest)
library(caret)
library(reshape2)
# Get the unique players
unique_players = unique(nba_data$Player)

# Create a data frame with the unique players
unique_nba_data = nba_data[!duplicated(nba_data$Player), ]

#positon comparision
# Create a bar plot of player positions
ggplot(nba_data, aes(x = Pos)) +
  geom_bar(fill = "blue") +
  geom_text(stat='count', aes(label=..count..), vjust=-1) +
  theme_minimal() +
  labs(x = "Position", y = "Count", title = "Breakdown of Total Player Positions")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Create a bar plot of player positions for the unique players
ggplot(unique_nba_data, aes(x = Pos)) +
  geom_bar(fill = "red") +
  geom_text(stat='count', aes(label=..count..), vjust=-1) +
  theme_minimal() +
  labs(x = "Position", y = "Count", title = "Breakdown of Unique Player Positions")

#splitting the data into train and test
set.seed(44) 

trainIndex = createDataPartition(nba_data$Pos, p = .7, list = FALSE)
## Warning in createDataPartition(nba_data$Pos, p = 0.7, list = FALSE): Some
## classes have a single record ( PG-SF, SF-C ) and these will be selected for the
## sample
train = nba_data[trainIndex,]
test  = nba_data[-trainIndex,]
train <- na.omit(train)
train$Pos = as.factor(train$Pos)
test$Pos = as.factor(test$Pos)

#summary(nba_data)

# predict 'Pos' using all other variables except 'Player', 'Tm', 'Season', and 'MVP'
model = randomForest(Pos ~ . - Player - Tm - Season - MVP, data = train)
#model
# Use the model to predict player positions in the test set
predictions = predict(model, newdata = test)
# Set the levels of the factor in the test data to match those of the training data
test$Pos <- factor(test$Pos, levels = levels(train$Pos))

# Finally, compare these predictions to the actual positions
cm = confusionMatrix(predictions, test$Pos)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   C C-PF  PF PF-C PF-SF  PG PG-SF PG-SG  SF SF-C SF-PF SF-SG  SG
##      C     325    4 101    0     0   0     0     0   8    0     0     0   3
##      C-PF    1    0   0    0     0   0     0     0   0    0     0     0   0
##      PF    129    2 444    3     2   2     0     0 121    0     1     0  29
##      PF-C    0    0   0    0     0   0     0     0   0    0     0     0   0
##      PF-SF   0    0   1    0     0   0     0     0   0    0     0     0   0
##      PG      0    0   3    0     0 692     0     7  24    0     0     1 127
##      PG-SF   0    0   0    0     0   0     0     0   0    0     0     0   0
##      PG-SG   0    0   0    0     0   1     0     0   0    0     0     0   0
##      SF     23    0 136    1     4  16     0     0 413    0     5     3 146
##      SF-C    0    0   0    0     0   0     0     0   0    0     0     0   0
##      SF-PF   0    0   0    0     0   0     0     0   0    0     0     0   0
##      SF-SG   0    0   0    0     0   0     0     0   1    0     0     0   0
##      SG      5    0  27    0     0 104     0     1 164    0     0     5 525
##      SG-PF   0    0   0    0     0   0     0     0   0    0     0     0   0
##      SG-PG   0    0   0    0     0   0     0     0   0    0     0     0   0
##      SG-SF   0    0   0    0     0   0     0     0   0    0     0     0   0
##           Reference
## Prediction SG-PF SG-PG SG-SF
##      C         0     0     0
##      C-PF      0     0     0
##      PF        0     0     1
##      PF-C      0     0     0
##      PF-SF     0     0     0
##      PG        0     5     1
##      PG-SF     0     0     0
##      PG-SG     0     0     0
##      SF        0     1     3
##      SF-C      0     0     0
##      SF-PF     0     0     0
##      SF-SG     0     0     0
##      SG        1     2     4
##      SG-PF     0     0     0
##      SG-PG     0     0     0
##      SG-SF     0     0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6612          
##                  95% CI : (0.6456, 0.6766)
##     No Information Rate : 0.2288          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5746          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: C Class: C-PF Class: PF Class: PF-C Class: PF-SF
## Sensitivity           0.67288   0.0000000    0.6236    0.000000    0.0000000
## Specificity           0.96312   0.9997239    0.9005    1.000000    0.9997239
## Pos Pred Value        0.73696   0.0000000    0.6049         NaN    0.0000000
## Neg Pred Value        0.95042   0.9983457    0.9074    0.998897    0.9983457
## Prevalence            0.13313   0.0016538    0.1963    0.001103    0.0016538
## Detection Rate        0.08958   0.0000000    0.1224    0.000000    0.0000000
## Detection Prevalence  0.12155   0.0002756    0.2023    0.000000    0.0002756
## Balanced Accuracy     0.81800   0.4998620    0.7621    0.500000    0.4998620
##                      Class: PG Class: PG-SF Class: PG-SG Class: SF Class: SF-C
## Sensitivity             0.8491           NA    0.0000000    0.5650          NA
## Specificity             0.9403            1    0.9997238    0.8833           1
## Pos Pred Value          0.8047           NA    0.0000000    0.5499          NA
## Neg Pred Value          0.9556           NA    0.9977943    0.8895          NA
## Prevalence              0.2246            0    0.0022051    0.2015           0
## Detection Rate          0.1907            0    0.0000000    0.1138           0
## Detection Prevalence    0.2370            0    0.0002756    0.2070           0
## Balanced Accuracy       0.8947           NA    0.4998619    0.7242          NA
##                      Class: SF-PF Class: SF-SG Class: SG Class: SG-PF
## Sensitivity              0.000000    0.0000000    0.6325    0.0000000
## Specificity              1.000000    0.9997237    0.8881    1.0000000
## Pos Pred Value                NaN    0.0000000    0.6265          NaN
## Neg Pred Value           0.998346    0.9975186    0.8907    0.9997244
## Prevalence               0.001654    0.0024807    0.2288    0.0002756
## Detection Rate           0.000000    0.0000000    0.1447    0.0000000
## Detection Prevalence     0.000000    0.0002756    0.2310    0.0000000
## Balanced Accuracy        0.500000    0.4998618    0.7603    0.5000000
##                      Class: SG-PG Class: SG-SF
## Sensitivity              0.000000     0.000000
## Specificity              1.000000     1.000000
## Pos Pred Value                NaN          NaN
## Neg Pred Value           0.997795     0.997519
## Prevalence               0.002205     0.002481
## Detection Rate           0.000000     0.000000
## Detection Prevalence     0.000000     0.000000
## Balanced Accuracy        0.500000     0.500000
#plotting the confusion matrix

# Assuming 'cm' is your confusion matrix
cm_matrix = as.matrix(cm$table)

# Convert the confusion matrix to a data frame
cm_df = as.data.frame(as.table(cm$table))

# Melt the data frame
cm_melt = melt(cm_df)
## Using Prediction, Reference as id variables
# Create the heatmap
ggplot(data = cm_melt, aes(x = Reference, y = Prediction, fill = value)) +
  geom_tile() +
  geom_text(aes(label = value), vjust = 0.5, color = "black") +
  scale_fill_gradient(low = "white", high = "red") +
  theme_minimal() 

Interpretation

The barcharts created show an overview of the dataset that we are working with. The first barchart shows the number of total player positions in the dataset. The second barchart shows the unique players by position in the dataset for reference

The model predicts the position of a player based on the majority vote from all the decision trees in the forest. Each tree gives a “vote” for that player’s position, and the position with the most votes becomes our model’s prediction. The confusion matrix and the statistics provide a comprehensive view of how well the model is performing. The model has an overall accuracy of about 68.16%, which means it correctly predicted the position of about 68.16% of the players in the test set.